build: Cargo workspace + native-common extraction (1/6)#104
Conversation
…e-common Move the standalone `native` crate into a root Cargo workspace and extract shared JNI plumbing (error->exception mapping, Tokio runtime singleton, StreamingReader) into a new `datafusion-jni-common` crate under `native-common/`. `native/src/errors.rs` moves to `native-common/src/errors.rs`; the nine native modules now import error/runtime helpers from `datafusion_jni_common`. Build glue follows: single root `Cargo.lock`, `.cargo/config.toml` redirects output to `rust-target/`, Makefile/CI/poms updated to build `--workspace` and target `-p datafusion-jni`. Core javadoc build commands updated to match. Pure refactor; no behavior change. First of a 6-PR stack splitting the Spark DataSource V2 connector work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on-jni-common Add [workspace.package] (version, edition, license, repository) to the root manifest and have both crates inherit it via `*.workspace = true`, so a single bump re-versions the workspace in lock step. Make datafusion-jni-common publishable: drop `publish = false` and add a `description` (crates.io requires it). All its dependencies are registry crates, so nothing blocks publish. datafusion-jni stays `publish = false` since it is a JVM-loaded cdylib, not a crates.io library. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `**/META-INF/services/**` and `spark/scaffold/bridge-template/**` RAT excludes guard files that do not exist until later splits in the stack. They are dead config here. Re-added in the splits that introduce the files they cover (META-INF/services in 04-spark-scala-connector, bridge-template in 05-bridge-scaffold). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `**/*.md` exclude dropped every markdown file repo-wide from license checking, not just docs. It was redundant (docs are already covered by `docs/**`) and overly broad. Removing it; RAT still reports 0 unapproved because the remaining markdown carries valid headers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9f78869 to
9d2096c
Compare
Add a crate README so the crates.io listing has front-page content, and wire it in via `readme = "README.md"`. The ASF license header is an HTML comment (matching dev/release/README.md) so RAT approves it while it stays invisible in the rendered markdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on the workspace-foundation PR: - development.md: trim the repo-layout section to the crates this PR actually ships (native, native-common). It was forward-referencing spark/, spark/bridge, datafusion-spark-bridge, and examples/native -- none of which exist until later PRs in the stack -- and called the member list "three" while listing four. Later PRs (apache#105/apache#106/apache#107/apache#109) carry notes to re-add their own slice when those dirs land. - rat_exclude_files.txt: the Rust lockfile moved to the workspace root, so the stale native/Cargo.lock entry left the root Cargo.lock with no RAT exclude for the source-tarball check (check-rat-report.py). Point it at Cargo.lock. - native-common: dedupe the panic-payload downcast -- StreamingReader::next now calls errors::panic_message instead of repeating the String/&str match inline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Cargo workspace conversion redirects build output to rust-target/
(via .cargo/config.toml), but the dev/release scripts still built from
native/ and read native/target/release/, which is no longer populated
even when cargo runs inside native/ (config is discovered up-tree).
- build-native-libs.sh / build-release.sh: build from the repo root with
`-p datafusion-jni` and copy from rust-target/{release,<triple>/release}/.
- verify-release-candidate.sh: run `cargo fmt --all` workspace-wide so the
new native-common crate is covered (matches CI lint.yml).
- updating-datafusion-version.md: list the workspace.dependencies entries
that actually exist (datafusion, -proto, -spark, -substrait); drop the
stray datafusion-ffi reference.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
No release script or doc describes publishing the Rust crates to crates.io, and the crate is an internal implementation detail of the native crates (its README already says the API may change to track their needs). Match `publish = false` on datafusion-jni so an accidental `cargo publish` can't push it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@paleolimbot @coderfender Here is the first of six stacked PRs to add a spark data source that uses a datafusion table provider underneath. This one only moves things around in the repo to prepare for adding in the new crate. The full original PR #103 is way too large to review effectively so I've broken it into stages. |
There was a problem hiding this comment.
Pull request overview
Refactors the Rust native build into a Cargo workspace and extracts shared JNI infrastructure into a new datafusion-jni-common crate, while updating Maven/scripts/CI to build from the workspace root and to use rust-target/ as the shared Cargo output directory.
Changes:
- Introduce a workspace root
Cargo.tomland newnative-common(datafusion-jni-common) crate; migrate shared JNI/runtime/streaming/error plumbing into it. - Repoint build outputs to
rust-target/via.cargo/config.tomland update Maven/Makefile/CI/release scripts to use workspace builds and the new artifact paths. - Update docs and test comments to reflect new build commands and workspace dependency management.
Reviewed changes
Copilot reviewed 31 out of 33 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Pin maven-compiler-plugin version; update RAT/license excludes for rust-target/ and root Cargo.lock. |
| Cargo.toml | Add workspace root, members, shared package metadata, and centralized [workspace.dependencies]. |
| Cargo.lock | Move to workspace root; lockfile content updated for the new workspace resolution. |
| .cargo/config.toml | Set Cargo target-dir = "rust-target" to keep Rust artifacts out of Maven-cleaned target/. |
| .gitignore | Ignore rust-target/ in addition to target/. |
| Makefile | Build/format/clean via workspace root; update runtime-metrics build command to use -p datafusion-jni. |
| native/Cargo.toml | Convert to workspace-inherited metadata/deps; add dependency on datafusion-jni-common. |
| native/src/lib.rs | Remove local errors module + inline StreamingReader; use datafusion-jni-common errors/runtime/streaming. |
| native/src/arrow.rs | Switch JNI error handling imports to datafusion-jni-common. |
| native/src/avro.rs | Switch JNI error handling imports to datafusion-jni-common. |
| native/src/cache_manager.rs | Switch JniResult import to datafusion-jni-common. |
| native/src/csv.rs | Switch JNI error handling imports to datafusion-jni-common. |
| native/src/json.rs | Switch JNI error handling imports to datafusion-jni-common. |
| native/src/object_store.rs | Switch JniResult import to datafusion-jni-common. |
| native/src/proto.rs | Switch JNI error handling imports to datafusion-jni-common. |
| native/src/runtime_metrics.rs | Switch JniResult import and update rebuild instructions for workspace package selection. |
| native/src/schema.rs | Switch JniResult import to datafusion-jni-common. |
| native-common/Cargo.toml | Add new datafusion-jni-common crate definition + avro feature forwarding. |
| native-common/src/lib.rs | Add shared runtime singleton + StreamingReader implementation. |
| native-common/src/errors.rs | Add shared error-to-exception mapping + exported panic_message. |
| native-common/README.md | Document purpose/linking model/status of the new shared native crate. |
| core/pom.xml | Update expected native library path to rust-target/... and update failure guidance text. |
| core/src/main/java/org/apache/datafusion/SessionContext.java | Update user-facing build instructions for substrait/runtime-metrics to include -p datafusion-jni. |
| core/src/test/java/org/apache/datafusion/SessionContextRuntimeStatsTest.java | Update build instructions for runtime-metrics feature to include -p datafusion-jni. |
| core/src/test/java/org/apache/datafusion/SessionContextSubstraitTest.java | Update build instructions for substrait feature to include -p datafusion-jni. |
| .github/workflows/build.yml | Update cargo cache paths/keys for workspace root lockfile + rust-target/. |
| .github/workflows/lint.yml | Run formatting/clippy at workspace root; update cargo cache paths/keys. |
| docs/source/contributor-guide/development.md | Update dev instructions and describe new workspace + rust-target/ behavior. |
| docs/source/contributor-guide/updating-datafusion-version.md | Update version-bump procedure to reflect workspace dependency pinning. |
| dev/release/build-release.sh | Update native build/copy paths and build commands for workspace + rust-target/. |
| dev/release/datafusion-java-rm/build-native-libs.sh | Build datafusion-jni from repo root and reference rust-target/... outputs. |
| dev/release/rat_exclude_files.txt | Exclude root Cargo.lock instead of native/Cargo.lock. |
| dev/release/verify-release-candidate.sh | Run cargo fmt workspace-wide instead of only in native/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I considered renaming |
|
Very cool! I'm happy to review this work because it's very relevant to what we do (Spark via Sedona Spark and DataFusion via SedonaDB), with the limitation that I don't have any experience with JNI (just general arrow-over-c). I do have a few questions about the approach before I get started. Notably, these are a few ways to avoid maintaining a pile of JNI (by exploiting other piles of JNI that other people have written already!) One approach is to use ADBC as the FFI bridge. As a TableProvider author, a thin layer of code (a few hundred lines) can get you a cdylib with the ADBC entrypoint defined, and on the Java side there is already a driver manager that can import the cdylib and "run a SQL query", getting one FFI_ArrowArrayStream per partition back. You have to figure out how to squeeze your scan options into the ADBC interface but you're already serializing/deserializing most of it. It might look like Another approach is to extract the FFI of the table provider you actually need and add it to At a high level I think my complaint about the approach in this stack is that you are proposing to build a cdylib that exports JNI entrypoints, whereas a cleaner approach (in my opinion) is to build a cdylib that exports entrypoints that just use the Arrow C Data/Stream interface and C types. That also has broader applicability to non-Java (i.e., can live in datafusion proper and get eyes/reviews from a wider audience). |
|
I missed the conversation on here before merging. Sorry about that. |
|
It’s my fault. I should have moved it to draft. I’ve been traveling and lost track |
|
All good - we'll need dedicated Rust and Java components no matter how this goes. I am busy trying to get SedonaDB 0.4 out this week but can prototype the |
|
@paleolimbot I started to churn on this and put up this PR: #110 I'm going to try digging in more tomorrow. I'll be out on vacation starting Thursday, so there's no rush on my side. |
This is part of a 6 PR stack. They build on each other, so review and merge in order — until the earlier parts merge, this PR's diff includes their changes too.
➤ 1. #104 — Cargo workspace +
native-commonextraction2. #105 —
datafusion-spark-bridgeRust SDK3. #106 — Spark connector Java SPI
4. #107 — Spark DataSource V2 connector (Scala)
5. #108 — Bridge scaffold generator
6. #109 — End-to-end examples
Purpose
This work enables exporting a DataFusion
TableProvideras a Spark DataSource.This PR prepares the build for the follow on work. We currently have a single rust crate, but with this work we switch to a workspace containing two crates. The follow on work will add a third crate specific to the Spark work.
We split off common JNI plumbing into the new crate
datafusion-jni-common. Both the existingdatafusion-jnicrate and the upcoming spark bridge will both depend on this new crate.We move the rust target to
rust-target/instead of being within thenative/targetfolder so that we have a single build directory for the workspace. We override the folder name to avoid maven deleting the folder unintentionally.This is a pure refactor — no behavior change.